This report explores a dataset containing results for the Test of English for International Communication (TOEIC), for approximately 2,000 enrollments of the Faculty of Business and Economics from the University of Chile. It also contains their results on the Chilean Higher Education Selection Exam (PSU) as well as other atributtes related to their High Schools.


Data Set

## [1] 1956   10
## 'data.frame':    1956 obs. of  10 variables:
##  $ year       : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ gender     : Factor w/ 2 levels "female","male": 1 1 2 1 1 2 2 2 2 2 ...
##  $ hs.type    : Factor w/ 3 levels "private","public",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ hs.location: Factor w/ 2 levels "capital_city",..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ hs.score   : int  744 744 682 744 682 723 785 682 702 744 ...
##  $ math       : int  688 718 774 756 706 774 850 737 673 850 ...
##  $ history    : int  NA 694 708 694 735 660 772 757 NA NA ...
##  $ science    : int  708 NA 638 NA 708 614 642 NA 672 727 ...
##  $ spanish    : int  743 776 710 732 801 721 700 754 682 792 ...
##  $ toeic      : num  677 673 667 667 667 660 657 657 653 653 ...
##       year         gender            hs.type           hs.location  
##  Min.   :2005   female: 791   private    :1017   capital_city:1562  
##  1st Qu.:2006   male  :1165   public     : 388   other_region: 394  
##  Median :2008                 semiprivate: 551                      
##  Mean   :2008                                                       
##  3rd Qu.:2009                                                       
##  Max.   :2010                                                       
##                                                                     
##     hs.score          math          history         science     
##  Min.   :435.0   Min.   :580.0   Min.   :432.0   Min.   :166.0  
##  1st Qu.:661.0   1st Qu.:694.0   1st Qu.:630.0   1st Qu.:606.0  
##  Median :702.0   Median :719.0   Median :677.0   Median :645.5  
##  Mean   :694.8   Mean   :726.5   Mean   :678.6   Mean   :642.7  
##  3rd Qu.:744.0   3rd Qu.:756.0   3rd Qu.:726.0   3rd Qu.:680.0  
##  Max.   :826.0   Max.   :850.0   Max.   :850.0   Max.   :824.0  
##                                  NA's   :462     NA's   :716    
##     spanish          toeic      
##  Min.   :440.0   Min.   :217.0  
##  1st Qu.:639.0   1st Qu.:407.0  
##  Median :674.0   Median :467.0  
##  Mean   :678.7   Mean   :461.6  
##  3rd Qu.:717.0   3rd Qu.:528.5  
##  Max.   :831.0   Max.   :677.0  
## 

Our dataset consists of ten variables, with almost 2,000 observations.

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   217.0   407.0   467.0   461.6   528.5   677.0

The TOEIC distribution appears to be unimodal with the score peaking around 475. It’s important to note that the scale tof this test goes from 10 to 990, so the mean (461) it’s quite low.

In regards to the categorical variables, we can see that how most enrollments are males, come from private high schools located in the capital city, Santiago.

It’s seems like hs.score has a semi discrete distribution shape. This make sense given that this is the result of a transformation from a different scale of scores (from 1 to 7). This it’s done by the National Eduation Ministry to make high school’s scores easier to compare with the higher education selection exam’s results.

It is important to say that for the high school score, as well as for the national tests of maths, history, science and spanish scores, the scale of scores goes from 350 to 850.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   580.0   694.0   719.0   726.5   756.0   850.0

It’s an interesting distribution as on the left tail looks like a continuos variable but the closer it gets to the maximum score (850), it starts to behave in a more discrete way. This make sense given the test it’s only 70 questions long and the penalties are relatively higher when you have less incorrect answers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   432.0   630.0   677.0   678.6   726.0   850.0     462

This distribution seems more “normal”, given that the median and mean (651 and 653) are further apart from the maximum test score (850), in comparison to the math scores.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   166.0   606.0   645.5   642.7   680.0   824.0     716

Looks quite unimodal with a peak around 650. Just as before, the further from the maximum score, the more continuos and normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   440.0   639.0   674.0   678.7   717.0   831.0

Similar case to math’s distribution.

It seems that most test results have a similar distribution shape, peaking around 650 points. With the exception beign math, where the peak is around 725 points. This makes sense, given that mathematics has a considerable higher relative value when enrolling into the Faculty.

Having said that, at this point I should mention that the net enrollment score it’s calculated by the following equaiton:

\(enrollment.score = math*0.5 + hs.score*0.2 + max(history,science)*0.2 + spanish*0.1\)

I am interested in that variable as well, so I will include it in the rest of the analysis. Note that the TOEIC test does not take part in this equation as it is not a requierment and it’s only undertaken to assess already enrolled students.


Univariate Analysis

What is the structure of your dataset?

There are 1,956 enrollments in the dataset with 10 variables (year, gender, hs.type, hs.location, hs.score, math, history, science, spanish and toeic).

Other observations:

  • There are about 30% more males than female enrollments.
  • The median spanish result is 674.
  • Most enrollments come from a high school located in the capital city.
  • Most enrollments correspond to the years 2007 to 2010.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are toeic and spanish. I’d like to determine which features are best for predicting results on the TOEIC test undertaken by new enrollments. I suspect spanish test score and some combination of the other variables can be used to build a predictive model for TOEIC results.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Hs.location and hs.score likely contribute to the level of english of a recent graduate. I think hs.type (either private or public) and enrollment.score probably contribute most to the TOEIC results as they could show a level of self confidence or higher general knoledge when completing the test.

Did you create any new variables from existing variables in the dataset?

I created a variable for the enrollment final score (enrollment.score) using the other tests’ scores and their correpsonding relative values. This arose in the univariate section of my analysis when realising that the TOEIC test it’s undertaken after the recent graduates are already enrolled and they know their final enrollment score, which could play a self confidence role when completing the TOEIC test.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The only unusual thing was how the the continuos variables tended to behave in a discrete way when approaching the highest posible value. Although this make sense given that the tests results are constructed to separate the whole population.

In the enrollment process only the highest score between history and science was considered to calculate the net enrollment score. This is why there where so many values with 0’s within this two tests. I transformed these values to NA’s.


Bivariate Plots Section

##                  hs.score  math history science spanish toeic
## hs.score             1.00 -0.39   -0.06    0.00    0.07  0.03
## math                -0.39  1.00    0.00    0.13    0.02  0.07
## history             -0.06  0.00    1.00    0.01    0.40  0.15
## science              0.00  0.13    0.01    1.00    0.21  0.11
## spanish              0.07  0.02    0.40    0.21    1.00  0.25
## toeic                0.03  0.07    0.15    0.11    0.25  1.00
## enrollment.score     0.09  0.73    0.43    0.36    0.43  0.20
##                  enrollment.score
## hs.score                     0.09
## math                         0.73
## history                      0.43
## science                      0.36
## spanish                      0.43
## toeic                        0.20
## enrollment.score             1.00

Toeic correlates strongly with spanish, which was my suspicion. However, to my surprise hs.score does not correalte strongly with toeic.

Spanish also correlates strongly with history, which makes sense. It also correlates with science, in a weaker way.

There’s a strong negative correlation between math and hs.score, which is weird.

Math, and science do not seem to have strong correlations with toeic.

I want to take a closer look at scatter plots involving toeic and some other continuos variables like spanish, hs.score, history and enrollment.score

It seems to be a lot of noise, but there’s definitely a positive relationship between spanish and toeic scores.

Nope. Even after adding jitter, transparency, and changing the size of the points, there doesn’t seem to be any relation between toeic and hs.score.

There is definitely a positive correlation, but the slope is not high as in with spanish.

This one was harder to see, so on top of the jitter, size and transparency, I utilised a smooth line (linear model) to establish the relation between toeic and enrollment.score.

Before I move on, I want to take another look to the relationship between hs.score and math.

There is a clear negative correlation. After thinking about this for a while, I guess it makes sense considering the enrollment criteria (formula).

Next, I’ll have a closer look at how the categorical features vary with toeic.

## enrollments$hs.location: capital_city
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   217.0   410.0   470.0   463.7   533.0   673.0 
## -------------------------------------------------------- 
## enrollments$hs.location: other_region
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   237.0   393.0   457.0   453.3   519.8   677.0

It seems like high schools located in the capital city have slightly higher scores than the ones from other regions. There are 27 points of difference between the median values for both groups, but it is not as relevant as I was expecting.

## enrollments$hs.type: private
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   223.0   443.0   500.0   493.8   551.0   677.0 
## -------------------------------------------------------- 
## enrollments$hs.type: public
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   217.0   380.0   433.0   431.4   487.0   620.0 
## -------------------------------------------------------- 
## enrollments$hs.type: semiprivate
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   217.0   360.0   430.0   423.4   490.0   630.0

There is a difference of 67 points for the median toeic value between private and public high scool, which is quite considerable.

## enrollments$gender: female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   217.0   397.0   453.0   451.9   513.0   677.0 
## -------------------------------------------------------- 
## enrollments$gender: male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   217.0   413.0   477.0   468.2   537.0   667.0

It seems like male scores are slightly higher for men than female. 24 points of difference in the median value between groups.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Toeic correlates strongly with spanish and slightly with history and enrollment.score. Also, hs.type has a considerable effect on toeic results.

Gender and hs.location have a lower but significant impact in toeic results. On the other hand, hs.score, math, and science do not have strong correlations with toeic.

Spanish correlates strongly with history and also, in a lower degree, with science.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Math and hs.core have a strong negative correlation. This could be explained as these scores are the two strongest coeficients in the enrollment net score equation. Also considering that most of the data is located at the very end of tests’ scale, it is likely that one enrollment has either a high score in math or high score at high school, but not both.

What was the strongest relationship you found?

Enrollments’ toeic test results are positively and strongly correlated with spanish results. With less strenght, toeic also correlates with enrollment.score and history.


Multivariate Plots Section

Given the correlation between toeic and spanish, I created a ratio between these two, in order to establish a “general linguistic” measurement. Then I wanted to see how these three categorical variables distributed along this ratio.

Now let’s see how these variables affect on the relation betweem toeic and spanish, in order to try and build a predictive model.

There’s is a trend of higher toeic results for private high school enrollments, although this trend it’s not very clear when looking high schools from other regions.

There is a small trend on higher scores for male enrollments, although not as strong as with hs.type.

These plots suggest that we can build a linear model and use those variables in the linear model to predict the enrollment’s TOEIC results.

# Create variables corresponding to each different model
m1 <- lm(toeic ~ spanish, data = enrollments)
m2 <- update(m1, ~ . + hs.type)
m3 <- update(m2, ~ . + gender)
m4 <- update(m3, ~ . + enrollment.score)
m5 <- update(m4, ~ . + history)
m6 <- update(m5, ~ . + hs.location)

# Table the results for each model
mtable(m1, m2, m3, m4, m5, m6, sdigits = 3)
## 
## Calls:
## m1: lm(formula = toeic ~ spanish, data = enrollments)
## m2: lm(formula = toeic ~ spanish + hs.type, data = enrollments)
## m3: lm(formula = toeic ~ spanish + hs.type + gender, data = enrollments)
## m4: lm(formula = toeic ~ spanish + hs.type + gender + enrollment.score, 
##     data = enrollments)
## m5: lm(formula = toeic ~ spanish + hs.type + gender + enrollment.score + 
##     history, data = enrollments)
## m6: lm(formula = toeic ~ spanish + hs.type + gender + enrollment.score + 
##     history + hs.location, data = enrollments)
## 
## ==============================================================================================================================================
##                                                 m1               m2               m3               m4               m5              m6        
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                                  195.090***       232.126***       219.250***       143.336**       141.579**       140.950**   
##                                                (23.417)         (21.792)         (21.952)         (48.479)        (53.502)        (53.460)    
##   spanish                                        0.393***         0.385***         0.391***         0.364***        0.358***        0.357***  
##                                                 (0.034)          (0.032)          (0.032)          (0.035)         (0.042)         (0.042)    
##   hs.type: public                                               -65.475***       -65.231***       -64.580***      -65.634***      -65.901***  
##                                                                  (4.748)          (4.731)          (4.743)         (5.397)         (5.395)    
##   hs.type: semiprivate                                          -67.402***       -66.281***       -65.244***      -65.104***      -64.033***  
##                                                                  (4.210)          (4.204)          (4.243)         (4.874)         (4.905)    
##   gender: male/female                                                             14.514***        13.445***       12.475**        12.184**   
##                                                                                   (3.658)          (3.707)         (4.263)         (4.263)    
##   enrollment.score                                                                                  0.134           0.141           0.146     
##                                                                                                    (0.076)         (0.088)         (0.088)    
##   history                                                                                                           0.003           0.003     
##                                                                                                                    (0.035)         (0.035)    
##   hs.location: other_region/capital_city                                                                                           -9.270     
##                                                                                                                                    (5.015)    
## ----------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                                      0.063            0.203            0.209            0.210           0.212           0.214     
##   adj. R-squared                                 0.062            0.202            0.207            0.208           0.209           0.210     
##   sigma                                         86.112           79.455           79.156           79.114          78.046          77.983     
##   F                                            130.425          165.453          128.962          103.897          66.557          57.629     
##   p                                              0.000            0.000            0.000            0.000           0.000           0.000     
##   Log-likelihood                            -11489.694       -11331.307       -11323.448       -11321.903       -8626.190       -8624.474     
##   Deviance                                14489469.061     12323068.799     12224441.218     12205141.077     9057579.026     9036802.711     
##   AIC                                        22985.387        22672.615        22658.897        22657.806       17268.379       17266.948     
##   BIC                                        23002.123        22700.508        22692.369        22696.857       17310.853       17314.731     
##   N                                           1956             1956             1956             1956            1494            1494         
## ==============================================================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Private high schools have a higher median for the ratio toeic/spanish. The variance across the groups seems to be about the same with semiprivate type of high school having the greatest variation for the middle 50% of enrollments.

Holding spanish test results constant, enrollments coming from a private high school get consistent higher toeic results than enrollments coming from a public or semiprivate high school.

Were there any interesting or surprising interactions between features?

Even though the impact of private high schools in toeic level results is high, this difference it is not so noticeable for enrollments coming from public or semiprivate high schools located outside of the capital city.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model starting from the toeic test results and spanish test results.

The variables in the linear model account for 21.4% of the variance in the toeic test results. Hs.type and gender improved the model considerably. The adition of history, enrollment.score and hs.location, improved moderatly the R^2 value, which is coherent with what we saw in the plots.


Final Plots and Summary

Plot One

Description One

The distribution of the TOEIC results for the enrollments appears to be unimodel peaking around 475. This is considerably low considering that the highest possible score is 990 points.

Plot Two

Description Two

Enrollments coming from private high schools have the highest median TOEIC result. The variance in TOEIC results it’s larger for enrollments coming from semiprivate high schools. In the case of public high schools, the variance is lower, similar to private high schools, but the median TOEIC results found here it’s lower and almost the same as for semiprivate high schools.

Plot Three

Description Three

The plot indicates that a linear model could be constructed to predict enrollments’s TOEIC performance using toeic as the outcome variable and spanish as the predictor variable. Holding spanish results constant, enrollments coming from private high schools, get consistent higher toeic results than enrollments coming from public or semiprivate high schools.

Reflection

The enrollments data set contains information on almost 2,000 new students of the Faculty of Business and Economics of the University of Chile enrolled between 2005 and 2010, across 10 variables including scores from the National Higher Education Selection test, as well as variables related to their high schools.

I started by looking and analysing the behaviour of certain variables within the data set, then I explored some questions of my interest as I kept on making observations on plots. Eventually I explored the TOEIC test results an its relation with the Spanish test scores and created a linear model to predict TOEIC test results.

There was a clear trend between spanish results and TOEIC results. I was surprised to find out that the high school score didn’t influenced the performance on the TOEIC test and that also had a negative strong correlation with math test results. These two variables have a strong relative weight in the equation for enrollment so it seems logical to think that enrollments would have high scores in either one or the other.

I was also expecting that enrollments coming from outside of the main capital would have had lower scores in the TOEIC test, but location turned out to have no effect. Then I realized that a private high schools had a strong positive effect on TOEIC results, which makes sense in a developing country like Chile.

The first and obvious limitations responds to missing variables, that are inherit of the person’s background, interest and skills. The other limitations of this model include the source of the data. Given that enrollments consider only periods between 2005 and 2010, and also that these recent students reflect only to a tiny part of the population’s interests and capabilities. Maybe today the linguistic skills have evolved under some other correlations like, social exposure, philosofical knowledge, or something else. At the same time, maybe recent high school graduates who are interested in arts have a more direct correlation between english skills and, for example, their sense of aesthetics.

In any case, I would be interested to analyse more updated data to see if maybe it could be worth to get enrollments from other disciplines to undertake the TOEIC test and hopefully increase the model’s accuracy. Under this context, and if the trends are still coming up, this could be a good input scource for public policy: It could be the case that it is worthwhile focusing the nations’ educational budget in languages (spanish+english) rather than just in spanish.